set automatically allocates new column slots if needed#5269
set automatically allocates new column slots if needed#5269ben-schwen wants to merge 33 commits intomasterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5269 +/- ##
=======================================
Coverage 98.59% 98.60%
=======================================
Files 79 79
Lines 14685 14691 +6
=======================================
+ Hits 14479 14486 +7
+ Misses 206 205 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Looks good in that this solution works and it's a good issue to fix. Just wondering about any performance decrease, particularly when updating existing columns by name inside a loop. When |
Benchmark - Updating existing columnsSetupI used two different installations of R but both with version 4.1.1. And put everything into a script I call twice. First line is always the benchmark with an integer set, second one with a character set. Scriptlibrary(data.table)
# ensure that if branch hits
options(datatable.alloccol=1024L)
set.seed(373)
DT = as.data.table(matrix(rnorm(1e4*1e4), ncol=1e4))
cols = lapply(1:100, function(i) sample(ncol(DT), 2e3))
colsns = lapply(1:100, function(i) colnames(DT)[sample(ncol(DT), 2e3)])
R.home()
microbenchmark::microbenchmark(times=10,
for (i in cols) {set(DT, j=i, value=DT$V1)},
for (i in colsns) {set(DT, j=i, value=DT$V1)}
)100 set / 2000 cols eachCurrent (
|
|
@mattdowle I tried to move everything down to the C level, hence, we save the 2nd What is still open is the question of whether we can also save the In case this is possible, we should change it also at |
R/data.table.R
Outdated
| name = substitute(x) | ||
| x = .Call(Cassign,x,i,j,NULL,value) | ||
| if (is.name(name)) | ||
| assign(as.character(name),x,parent.frame(),inherits=TRUE) |
There was a problem hiding this comment.
This assign() looks pretty complicated to reason about, I would like to see more tests of robustness here (setDT() within a function, places where inherits=TRUE is important, etc.).
At a glance I can't tell why this is needed though -- it definitely warrants a comment explaining why we need to branch.
There was a problem hiding this comment.
I simplified now the whole code. I also added a comment why we need the assign.
Ultimately it would be nice to let set work also in the cases where it needs to overallocate and is inside a function, howeover, for making this to work we would need to find the right environment where to assign (might not be the first where x is present) and it is not clear whether it would be the last due to scoping
|
Generated via commit 4c5b1a2 Download link for the artifact containing the test results: ↓ atime-results.zip
|
| { | ||
| .Call(Cassign,x,i,j,NULL,value) | ||
| set = function(x, i=NULL, j, value) { | ||
| name = as.character(substitute(x)) |
There was a problem hiding this comment.
set(x) expects that x is a data.table.
We already use a similar approach inside := to assign the name.

Closes #1831
Closes #4100 (what's left is the duplicate of #496)
Towards #496 establishing consistent behavior between
setand:=Towards #678 closing
oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.